In this tutorial, we will cover some basics about how to apply network analyses to ecological communities. After this practical you should be able to know how to construct a network, represent it, and perform basic analyses. Keep in mind that this tutorial is just an introduction to the topic of network analyses, if you are more interested, there are plenty of online materials but also do not hesitate to contact us for more.
All this tutorial is run in R and RStudio. However, DON’T PANIC! this is not an R course and we won’t evaluate your R skills, you should be able to follow by using this html document.
Network analysis is a set of techniques that can be used to investigate the causes and consequences of complex social and ecological interactions among the species within a community (Farine & Whitehead, 2015). Networks enable the visualization of complex, multidimensional data as well as provide diverse statistical indices for interpreting the resultant graphs (Jones et al., 2018). This set of techniques were first developed in the 1930s by social scientists (Wasserman & Faust, 1994), to investigate the link between local patterns of human relationships and social processes, such as the impact of social groups on the likelihood of being obese (Christakis & Fowler, 2007).
Network analysis is based on graph theory and provides a flexible framework for analysing association or interaction data to address a broad set of biological questions, such as:
The term ‘network’ can mean different things to different people. Here, we will define a network as a description for the observed patterns of associations or interactions between species or conspecifics (Farine & Whitehead, 2015).
Networks consist of nodes connected by edges:
Figure 1. Network representation and its components.
We can represent networks in two ways.
Rows and columns represent nodes (n x n).
Values of the matrix represent the weight of each edge.
Diagonals represent self-edges (usually equal to 0 in ecological communities).
Symmetrical (undirected networks) or asymmetrical (directed networks).
Figure 2. Different methods for network representation.
Are the network and the adjacency matrix from Figure 2 directed or undirected?
The first step of any data analysis is to load the data. Here, we will be using a small network of interactions from the movie Star Wars Episode IV. Each node is a character and each edge indicates whether they appeared together in a scene of the movie or not. Edges here are thus undirected and they also have weights attached, since they can appear in multiple scenes together.
The first step of any r script is to install and/or load
the libraries that we are going to use. To install packages we will use
the function install.packages which allows us to install
any r packages that we ask for. To load these packages we
are going to use the function library which allows us to
call the r packages that we have installed in our
computer.
# install.packages("igraph")
# install.packages("tidyverse")
# install.packages("DT")
library(igraph)
library(tidyverse)
library(DT) # visualize tables in Rmarkdown (this package doesn't need to be loaded by you)Then we load the data into r. We will do this using the
function read.csv. Notice that we load the list of edges
and nodes separately.
edges <- read.csv("data/star-wars-network-edges.csv")
nodes <- read.csv("data/star-wars-network-nodes.csv")Now we can have a look at the data
edges %>%
DT::datatable()nodes %>%
DT::datatable()How many characters appear in the movie?
How many unique interactions do we have?
How many times C-3PO and R2-D2 appeared in scenes together?
So, how do we convert these two datasets into a network object in R?
There are multiple packages to work with networks, but the most popular
is igraph because it’s very flexible and easy to use. Other
packages that you may want to explore are sna and
networks.
To create a network we need to create an igraph object.
To do that, we will use the function graph_from_data_frame,
which has two arguments: d, a data frame with the edge
list; and vertices, a data frame with node data with the
node label in the first column.
g <- graph_from_data_frame(d=edges, vertices=nodes, directed=FALSE)
g## IGRAPH a55215c UNW- 22 60 --
## + attr: name (v/c), id (v/n), weight (e/n)
## + edges from a55215c (vertex names):
## [1] R2-D2 --C-3PO R2-D2 --LUKE R2-D2 --OBI-WAN
## [4] R2-D2 --LEIA R2-D2 --HAN R2-D2 --CHEWBACCA
## [7] R2-D2 --DODONNA CHEWBACCA --OBI-WAN CHEWBACCA --C-3PO
## [10] CHEWBACCA --LUKE CHEWBACCA --HAN CHEWBACCA --LEIA
## [13] CHEWBACCA --DARTH VADER CHEWBACCA --DODONNA LUKE --CAMIE
## [16] CAMIE --BIGGS LUKE --BIGGS DARTH VADER--LEIA
## [19] LUKE --BERU BERU --OWEN C-3PO --BERU
## [22] LUKE --OWEN C-3PO --LUKE C-3PO --OWEN
## + ... omitted several edges
Let’s interpret the output together
U means that the network is undirected.N means named graph.W means weighted graph.22 is the number of nodes.60 is the number of edges.name (v/c) means name is a node attribute and
it’s a character.weight (e/n) means weight is an edge attribute
and it’s numeric.We also have a bunch of functions to access to specific elements within the igraph object:
V(g) # nodes## + 22/22 vertices, named, from a55215c:
## [1] R2-D2 CHEWBACCA C-3PO LUKE DARTH VADER CAMIE
## [7] BIGGS LEIA BERU OWEN OBI-WAN MOTTI
## [13] TARKIN HAN GREEDO JABBA DODONNA GOLD LEADER
## [19] WEDGE RED LEADER RED TEN GOLD FIVE
V(g)$name # names of each node## [1] "R2-D2" "CHEWBACCA" "C-3PO" "LUKE" "DARTH VADER"
## [6] "CAMIE" "BIGGS" "LEIA" "BERU" "OWEN"
## [11] "OBI-WAN" "MOTTI" "TARKIN" "HAN" "GREEDO"
## [16] "JABBA" "DODONNA" "GOLD LEADER" "WEDGE" "RED LEADER"
## [21] "RED TEN" "GOLD FIVE"
vertex_attr(g) # all attributes of the nodes## $name
## [1] "R2-D2" "CHEWBACCA" "C-3PO" "LUKE" "DARTH VADER"
## [6] "CAMIE" "BIGGS" "LEIA" "BERU" "OWEN"
## [11] "OBI-WAN" "MOTTI" "TARKIN" "HAN" "GREEDO"
## [16] "JABBA" "DODONNA" "GOLD LEADER" "WEDGE" "RED LEADER"
## [21] "RED TEN" "GOLD FIVE"
##
## $id
## [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
E(g) # edges## + 60/60 edges from a55215c (vertex names):
## [1] R2-D2 --C-3PO R2-D2 --LUKE R2-D2 --OBI-WAN
## [4] R2-D2 --LEIA R2-D2 --HAN R2-D2 --CHEWBACCA
## [7] R2-D2 --DODONNA CHEWBACCA --OBI-WAN CHEWBACCA --C-3PO
## [10] CHEWBACCA --LUKE CHEWBACCA --HAN CHEWBACCA --LEIA
## [13] CHEWBACCA --DARTH VADER CHEWBACCA --DODONNA LUKE --CAMIE
## [16] CAMIE --BIGGS LUKE --BIGGS DARTH VADER--LEIA
## [19] LUKE --BERU BERU --OWEN C-3PO --BERU
## [22] LUKE --OWEN C-3PO --LUKE C-3PO --OWEN
## [25] C-3PO --LEIA LUKE --LEIA LEIA --BERU
## [28] LUKE --OBI-WAN C-3PO --OBI-WAN LEIA --OBI-WAN
## + ... omitted several edges
E(g)$weight # weights for each edge## [1] 17 13 6 5 5 3 1 7 5 16 19 11 1 1 2 2 4 1 3 3 2 3 18 2 6
## [26] 17 1 19 6 1 2 1 7 9 26 1 1 6 1 1 13 1 1 1 1 1 1 2 1 1
## [51] 3 3 1 1 3 1 2 1 1 1
edge_attr(g) # all attributes of the edges## $weight
## [1] 17 13 6 5 5 3 1 7 5 16 19 11 1 1 2 2 4 1 3 3 2 3 18 2 6
## [26] 17 1 19 6 1 2 1 7 9 26 1 1 6 1 1 13 1 1 1 1 1 1 2 1 1
## [51] 3 3 1 1 3 1 2 1 1 1
g[] # adjacency matrix## 22 x 22 sparse Matrix of class "dgCMatrix"
##
## R2-D2 . 3 17 13 . . . 5 . . 6 . . 5 . . 1 . . . . .
## CHEWBACCA 3 . 5 16 1 . . 11 . . 7 . . 19 . . 1 . . . . .
## C-3PO 17 5 . 18 . . 1 6 2 2 6 . . 6 . . . . . 1 . .
## LUKE 13 16 18 . . 2 4 17 3 3 19 . . 26 . . 1 1 2 3 1 .
## DARTH VADER . 1 . . . . . 1 . . 1 1 7 . . . . . . . . .
## CAMIE . . . 2 . . 2 . . . . . . . . . . . . . . .
## BIGGS . . 1 4 . 2 . 1 . . . . . . . . . 1 2 3 . .
## LEIA 5 11 6 17 1 . 1 . 1 . 1 1 1 13 . . . . . 1 . .
## BERU . . 2 3 . . . 1 . 3 . . . . . . . . . . . .
## OWEN . . 2 3 . . . . 3 . . . . . . . . . . . . .
## OBI-WAN 6 7 6 19 1 . . 1 . . . . . 9 . . . . . . . .
## MOTTI . . . . 1 . . 1 . . . . 2 . . . . . . . . .
## TARKIN . . . . 7 . . 1 . . . 2 . . . . . . . . . .
## HAN 5 19 6 26 . . . 13 . . 9 . . . 1 1 . . . . . .
## GREEDO . . . . . . . . . . . . . 1 . . . . . . . .
## JABBA . . . . . . . . . . . . . 1 . . . . . . . .
## DODONNA 1 1 . 1 . . . . . . . . . . . . . 1 1 . . .
## GOLD LEADER . . . 1 . . 1 . . . . . . . . . 1 . 1 1 . .
## WEDGE . . . 2 . . 2 . . . . . . . . . 1 1 . 3 . .
## RED LEADER . . 1 3 . . 3 1 . . . . . . . . . 1 3 . 1 .
## RED TEN . . . 1 . . . . . . . . . . . . . . . 1 . .
## GOLD FIVE . . . . . . . . . . . . . . . . . . . . . .
g[1,] # first row of adjacency matrix## R2-D2 CHEWBACCA C-3PO LUKE DARTH VADER CAMIE
## 0 3 17 13 0 0
## BIGGS LEIA BERU OWEN OBI-WAN MOTTI
## 0 5 0 0 6 0
## TARKIN HAN GREEDO JABBA DODONNA GOLD LEADER
## 0 5 0 0 1 0
## WEDGE RED LEADER RED TEN GOLD FIVE
## 0 0 0 0
How can we visualize this network? We can use the function
plot() from base r, which can produce a decent
output:
par(mar=c(0,0,0,0))
plot(g)As you can see, there is a lot of room for improvement in this
figure. Similarly, to any r function, we can check all the
available plotting options, using this code
?igraph.plotting. Let’s improve the graph a bit.
par(mar=c(0,0,0,0))
plot(g,
vertex.color = "grey", # change color of nodes
vertex.label.color = "black", # change color of labels
vertex.label.cex = .75, # change size of labels to 75% of original size
edge.curved=.25, # add a 25% curve to the edges
edge.color="grey20") # change edge color to greyIn some cases we might want to modify some of the plotting attributes
according to some properties of the network. For example, we could
change the size of the nodes and node labels so that they match their
importance. In our data set, strength
corresponds to the number of scenes the characters appear in.
V(g)$size <- strength(g)
par(mar=c(0,0,0,0))
plot(g)The plot is a bit messy, and we can’t actually see all the characters. One thing we can do is to show only those characters that appear in 10 or more scenes.
# taking the log to improve the visualisation
V(g)$size <- log(strength(g)) * 4 + 3
V(g)$label <- ifelse( strength(g)>=10, V(g)$name, NA )
par(mar=c(0,0,0,0))
plot(g)# what does `ifelse` do?
nodes$name=="R2-D2"## [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
ifelse(nodes$name=="R2-D2", "yes", "no")## [1] "yes" "no" "no" "no" "no" "no" "no" "no" "no" "no" "no" "no"
## [13] "no" "no" "no" "no" "no" "no" "no" "no" "no" "no"
ifelse(grepl("R", nodes$name), "yes", "no")## [1] "yes" "no" "no" "no" "yes" "no" "no" "no" "yes" "no" "no" "no"
## [13] "yes" "no" "yes" "no" "no" "yes" "no" "yes" "yes" "no"
We can also change the colors of each node based on what side they’re in (dark side or light side).
# create vectors with characters in each side
dark_side <- c("DARTH VADER", "MOTTI", "TARKIN")
light_side <- c("R2-D2", "CHEWBACCA", "C-3PO", "LUKE", "CAMIE", "BIGGS", "LEIA", "BERU", "OWEN", "OBI-WAN", "HAN", "DODONNA", "GOLD LEADER", "WEDGE", "RED LEADER", "RED TEN", "GOLD FIVE")
other <- c("GREEDO", "JABBA")
# node we'll create a new color variable as a node property
V(g)$color <- NA
V(g)$color[V(g)$name %in% dark_side] <- "red"
V(g)$color[V(g)$name %in% light_side] <- "gold"
V(g)$color[V(g)$name %in% other] <- "grey20"
vertex_attr(g)## $name
## [1] "R2-D2" "CHEWBACCA" "C-3PO" "LUKE" "DARTH VADER"
## [6] "CAMIE" "BIGGS" "LEIA" "BERU" "OWEN"
## [11] "OBI-WAN" "MOTTI" "TARKIN" "HAN" "GREEDO"
## [16] "JABBA" "DODONNA" "GOLD LEADER" "WEDGE" "RED LEADER"
## [21] "RED TEN" "GOLD FIVE"
##
## $id
## [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
##
## $size
## [1] 18.648092 19.572539 19.635532 22.439250 12.591581 8.545177 13.556229
## [8] 19.310150 11.788898 11.317766 18.567281 8.545177 12.210340 20.528107
## [15] 3.000000 3.000000 9.437752 9.437752 11.788898 13.259797 5.772589
## [22] -Inf
##
## $label
## [1] "R2-D2" "CHEWBACCA" "C-3PO" "LUKE" "DARTH VADER"
## [6] NA "BIGGS" "LEIA" NA NA
## [11] "OBI-WAN" NA "TARKIN" "HAN" NA
## [16] NA NA NA NA "RED LEADER"
## [21] NA NA
##
## $color
## [1] "gold" "gold" "gold" "gold" "red" "gold" "gold" "gold"
## [9] "gold" "gold" "gold" "red" "red" "gold" "grey20" "grey20"
## [17] "gold" "gold" "gold" "gold" "gold" "gold"
par(mar=c(0,0,0,0))
plot(g)# what does %in% do?
1 %in% c(1,2,3,4)## [1] TRUE
1 %in% c(2,3,4)## [1] FALSE
If we want to indicate what the colors correspond to, we can add a legend.
par(mar=c(0,0,0,0))
plot(g)
legend(x=.75, y=.75,
legend=c("Dark side", "Light side", "Other"),
pch=21, pt.bg=c("red", "gold", "grey20"),
pt.cex=2, bty="n")Edge properties can also be modified. For example, we can change the width of each edge, to make it a function of the log number of scenes those two characters appear together.
E(g)$width <- log(E(g)$weight) + 1
edge_attr(g)## $weight
## [1] 17 13 6 5 5 3 1 7 5 16 19 11 1 1 2 2 4 1 3 3 2 3 18 2 6
## [26] 17 1 19 6 1 2 1 7 9 26 1 1 6 1 1 13 1 1 1 1 1 1 2 1 1
## [51] 3 3 1 1 3 1 2 1 1 1
##
## $width
## [1] 3.833213 3.564949 2.791759 2.609438 2.609438 2.098612 1.000000 2.945910
## [9] 2.609438 3.772589 3.944439 3.397895 1.000000 1.000000 1.693147 1.693147
## [17] 2.386294 1.000000 2.098612 2.098612 1.693147 2.098612 3.890372 1.693147
## [25] 2.791759 3.833213 1.000000 3.944439 2.791759 1.000000 1.693147 1.000000
## [33] 2.945910 3.197225 4.258097 1.000000 1.000000 2.791759 1.000000 1.000000
## [41] 3.564949 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.693147
## [49] 1.000000 1.000000 2.098612 2.098612 1.000000 1.000000 2.098612 1.000000
## [57] 1.693147 1.000000 1.000000 1.000000
par(mar=c(0,0,0,0))
plot(g)If you noticed, every time we visualise the network with the function
plot, the nodes appear in a different location. This is
because it’s running a probabilistic function trying to locate them in
the optimal way possible.
However, we can also specify the layout for the plot;
that is, the (x,y) coordinates where each node will be placed.
igraph has a few different layouts built-in, that will use
different algorithms to find an optimal distribution of
nodes. The following code illustrates some of these:
par(mfrow=c(2, 3), mar=c(0,0,1,0))
plot(g, layout=layout_randomly, main="Random")
plot(g, layout=layout_in_circle, main="Circle")
plot(g, layout=layout_as_star, main="Star")
plot(g, layout=layout_as_tree, main="Tree")
plot(g, layout=layout_on_grid, main="Grid")
plot(g, layout=layout_with_fr, main="Force-directed")Note that each of these is actually just a matrix of (x,y) locations for each node.
l <- layout_randomly(g)
str(l)## num [1:22, 1:2] -0.9 -0.309 -0.967 0.744 -0.164 ...
The most popular layouts are force-directed.
These algorithms, such as Fruchterman-Reingold, try to position the
nodes so that the edges have similar length and there are as few
crossing edges as possible. The idea is to generate “clean” layouts,
where nodes that are closer to each other share more connections in
common that those that are located further apart. Note that this is a
random algorithm, it might change every time we run it or among
different computers. If we want to make it consistent, we will use the
function set.seed, which allows us to control the random
process and make it consistent through time. That is, if we
set.seed and we add a number to it
set.seed(777) if we run the plot in different computers and
at different times the plot will be the same. If we change the number,
the plot will change. Let’s see an example.
par(mfrow=c(1,2))
set.seed(777)
fr <- layout_with_fr(g, niter=1000)
par(mar=c(0,0,0,0))
plot(g, layout=fr)
set.seed(666)
fr <- layout_with_fr(g, niter=1000)
par(mar=c(0,0,0,0))
plot(g, layout=fr)Using the Game of Thrones data set that you will be given plot the network of the characters from Game of Thrones using the most optimal display.
Ok, so now we have seen the very basics of how to visualise a network. While visualising the graph already gives us some information, we can use the network to do much more. For example, what are the most important species (nodes) in a network? What is the propensity of two species (nodes) that are connected to both be related to a third one? What are the different hidden communities in a network? So now we are going to have a look at some of these properties using the Star Wars network.
Studies use network metrics to estimate differences between individuals in their placement within a social network (at the node level). In animal networks, the most common are based on measures of importance or centrality. This means that with the code below we can measure, in different ways, the importance of a given node (species), or, in our case, the characters of the movie.
The most basic measure is degree, that can be
calculated with the function degree. The binary degree is
the count of the number of edges connected to the node. If a network is
directed, degree can be the partitioned into in-degree and out-degree,
representing the number of incoming and outgoing edges, respectively.
You can find these using mode="in" or
mode="out" or mode="total". This measure
captures the gregariousness of individuals, in terms of the number of
associates or interaction partners. In the Star Wars network, it will be
the unique number of characters that each character is interacting with.
We use the function sort to order the values, from low to
high.
sort(degree(g))## GOLD FIVE GREEDO JABBA CAMIE RED TEN OWEN
## 0 1 1 2 2 3
## MOTTI TARKIN BERU DARTH VADER DODONNA GOLD LEADER
## 3 3 4 5 5 5
## WEDGE R2-D2 BIGGS OBI-WAN RED LEADER CHEWBACCA
## 5 7 7 7 7 8
## HAN C-3PO LEIA LUKE
## 8 10 12 15
Strength is the sum of all edge weights connected to the node (if all edges have a weight of 1, then the strength will equal the binary degree). Strength can also be partitioned into in-strength and out-strength for directed networks. This measure typically represents the expected total interaction or association rate per sample. For example, a node with a strength of 2 would be expected to be observed with approximately two other individuals on average (if using most association indices).
sort(strength(g))## GOLD FIVE GREEDO JABBA RED TEN CAMIE MOTTI
## 0 1 1 2 4 4
## DODONNA GOLD LEADER OWEN BERU WEDGE TARKIN
## 5 5 8 9 9 10
## DARTH VADER RED LEADER BIGGS OBI-WAN R2-D2 LEIA
## 11 13 14 49 50 59
## CHEWBACCA C-3PO HAN LUKE
## 63 64 80 129
Why is the order of the degree and the strength so different?
Closeness measures how many steps are required to access every other node from a given node. It’s a measure of how long information takes to arrive (who hears news first?). So it attempts to capture the notion that a vertex is ‘central’ if it is ‘close’ to many other vertices. Higher values mean less centrality.
sort(closeness(g, normalized=TRUE))## GREEDO JABBA HAN OWEN CAMIE TARKIN
## 0.1282051 0.1282051 0.1459854 0.2083333 0.2247191 0.2500000
## OBI-WAN MOTTI R2-D2 BERU WEDGE RED TEN
## 0.2631579 0.2631579 0.2739726 0.2739726 0.2777778 0.2857143
## CHEWBACCA DARTH VADER LUKE C-3PO LEIA DODONNA
## 0.2985075 0.2985075 0.3076923 0.3125000 0.3278689 0.3333333
## GOLD LEADER BIGGS RED LEADER
## 0.3333333 0.3448276 0.3448276
Betweenness a count of the number of shortest paths that flow through the node. This measures how important a node is for connecting disparate parts of a social network. Individuals with a high betweenness centrality are likely to connect largely independent communities. This often highlights individuals that have a greater tendency to change groups than others.
sort(betweenness(g))## CAMIE OWEN OBI-WAN MOTTI TARKIN GREEDO
## 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## JABBA WEDGE GOLD FIVE BERU RED TEN DARTH VADER
## 0.000000 0.000000 0.000000 1.666667 2.200000 15.583333
## CHEWBACCA LUKE R2-D2 GOLD LEADER RED LEADER BIGGS
## 15.916667 18.333333 22.750000 23.800000 31.416667 31.916667
## C-3PO HAN DODONNA LEIA
## 32.783333 37.000000 47.533333 59.950000
Eigenvector centrality the sum of the centralities of an individual’s neighbours. High centrality can be achieved by having either a large degree or being connected to associates with a high degree (or both). This is a useful property as it captures the potential ‘importance’ of individuals in the network, as social hubs, or for the propagation of information or diseases through animal populations.
sort(eigen_centrality(g)$vector)## GOLD FIVE MOTTI TARKIN GREEDO JABBA RED TEN
## 1.839045e-20 9.118469e-03 1.133319e-02 1.154515e-02 1.154515e-02 1.512880e-02
## GOLD LEADER DARTH VADER CAMIE DODONNA WEDGE OWEN
## 1.717615e-02 2.671707e-02 3.064953e-02 3.143121e-02 3.410098e-02 6.256707e-02
## RED LEADER BERU BIGGS R2-D2 OBI-WAN LEIA
## 6.476889e-02 7.063977e-02 7.856101e-02 5.030995e-01 5.417666e-01 5.923767e-01
## C-3PO CHEWBACCA HAN LUKE
## 5.957835e-01 6.577603e-01 8.125507e-01 1.000000e+00
Authority score is another measure of centrality initially applied to the Web. A node has high authority when it is linked by many other nodes that are linking many other nodes.
sort(authority_score(g)$vector)## GOLD FIVE MOTTI TARKIN GREEDO JABBA RED TEN
## 3.874369e-17 9.118469e-03 1.133319e-02 1.154515e-02 1.154515e-02 1.512880e-02
## GOLD LEADER DARTH VADER CAMIE DODONNA WEDGE OWEN
## 1.717615e-02 2.671707e-02 3.064953e-02 3.143121e-02 3.410098e-02 6.256707e-02
## RED LEADER BERU BIGGS R2-D2 OBI-WAN LEIA
## 6.476889e-02 7.063977e-02 7.856101e-02 5.030995e-01 5.417666e-01 5.923767e-01
## C-3PO CHEWBACCA HAN LUKE
## 5.957835e-01 6.577603e-01 8.125507e-01 1.000000e+00
Page rank approximates probability that any message will arrive to a particular node. This algorithm was developed by Google founders, and originally applied to website links. It returns the importance of each node as a proportion. The value for each node is proportional to the importance of all the nodes connected to it. Individuals with a large page rank are disproportionately important for connecting different components of the network, and this measure is likely to be important when investigating flows through networks.
sort(page_rank(g)$vector)## GOLD FIVE JABBA GREEDO RED TEN CAMIE DODONNA
## 0.007092199 0.008310156 0.008310156 0.010573836 0.013792262 0.016185680
## MOTTI GOLD LEADER OWEN BERU WEDGE TARKIN
## 0.016813964 0.017945853 0.018881975 0.020368818 0.026377242 0.034180007
## DARTH VADER RED LEADER BIGGS OBI-WAN R2-D2 LEIA
## 0.034576040 0.034578060 0.035070288 0.067378471 0.068538690 0.086027500
## CHEWBACCA C-3PO HAN LUKE
## 0.086390090 0.088708430 0.114631333 0.185268949
We can visualise all these different measures of centrality.
# Calculate them
deg <- degree(g)
str <- strength(g)
clo <- closeness(g)
bet <- betweenness(g)
eig <- eigen_centrality(g)
aut <- authority_score(g)
ran <- page_rank(g)
# Plot the characters with the highest values of each parameter
par(mar = c(3.1, 8.1, 3.1, 1), mfrow = c(3, 3))
barplot(deg[order(deg, decreasing = TRUE)][1:5], las = 2, horiz = TRUE,
main = "Degree")
barplot(str[order(str, decreasing = TRUE)][1:5], las = 2, horiz = TRUE,
main = "Weighted degree")
barplot(clo[order(clo, decreasing = TRUE)][1:5], las = 2, horiz = TRUE,
main = "Closeness")
barplot(bet[order(bet, decreasing = TRUE)][1:5], las = 2, horiz = TRUE,
main = "Betweenness")
barplot(eig$vector[order(eig$vector, decreasing = TRUE)][1:5], las = 2,
horiz = TRUE, main = "Eigenvector centrality")
barplot(aut$vector[order(aut$vector, decreasing = TRUE)][1:5], las = 2,
horiz = TRUE, main = "Authority score")
barplot(ran$vector[order(ran$vector, decreasing = TRUE)][1:5], las = 2,
horiz = TRUE, main = "Page Rank")Do you spot any difference between the eigenvector centrality, authority score and the page rank?
Which is the most central character in the movie?
Finally, not exactly a measure of centrality, but we can learn more
about who each node is connected to by using the following functions:
neighbors (for direct neighbors) and ego (for
neighbors up to n neighbors away)
neighbors(g, v=which(V(g)$name=="DARTH VADER"))## + 5/22 vertices, named, from e83521e:
## [1] CHEWBACCA LEIA OBI-WAN MOTTI TARKIN
ego(g, order=2, nodes=which(V(g)$name=="DARTH VADER"))## [[1]]
## + 14/22 vertices, named, from e83521e:
## [1] DARTH VADER CHEWBACCA LEIA OBI-WAN MOTTI TARKIN
## [7] R2-D2 C-3PO LUKE HAN DODONNA BIGGS
## [13] BERU RED LEADER
Who are the most relevant characters in the the tv show Game of Thrones, according to different measures of centrality?
All the above mentioned measures of centrality are useful on their own, but the correlation between them can also give us information about the structure of the network. It is important to be careful when interpreting network metrics, as these depend on both the measure used (e.g. degree vs. betweenness) and the edge definition (e.g. rates vs. number of interactions), and on how the network is structured (e.g. if a network is structured into communities, metrics calculated within a single community may be very different from those calculated for the entire network). Because the metrics can have different interpretations on different social networks, it is important to visualise the structure of the network and the correlation structures between different metrics.
In Figure 3, Farine & Whitehead (2015) present two toy networks. The first (a) contains two clusters joined by a single individual. In this network, node 1 provides a bridge between the two clusters, and it has a high betweenness (b). However, node 1 has the lowest degree. Thus, if we were to investigate dynamics of spread, betweenness might provide a better estimate of relative node importance. The second network (c) contains individuals that are more uniformly connected. Individual 1 in this network has both the highest degree and, by far, the highest betweenness. Thus, when interpreting the relative importance of nodes in a network, the relationships between different network metrics may be informative.
Figure 3. The effect of network structure on network metric correlations (Farine & Whitehead, 2015).
To understand the relationship of these different metrics in our data
we can correlate them. To do that, we need to first put all the
centrality metrics together in a single data frame using the function
cbind() and correlate them using the function
cor(). We are going to round the correlation to two
decimals using the function round(,2).
centralities <- cbind(deg, str, bet, eig$vector, aut$vector, ran$vector)
colnames(centralities) <- c("degree", "structure", "betweeness", "eigenvector", "authority", "rank")
round(cor(centralities), 2)## degree structure betweeness eigenvector authority rank
## degree 1.00 0.88 0.66 0.84 0.84 0.90
## structure 0.88 1.00 0.41 0.98 0.98 0.99
## betweeness 0.66 0.41 1.00 0.43 0.43 0.42
## eigenvector 0.84 0.98 0.43 1.00 1.00 0.96
## authority 0.84 0.98 0.43 1.00 1.00 0.96
## rank 0.90 0.99 0.42 0.96 0.96 1.00
Compare the centrality of the Star Wars network with the one from Game of Thrones.
Another important thing we can do with a network is to describe its properties as a whole.
We can start with measures of the size of a network.
diameter is the length of the longest path (in number of
edges) between two nodes. We can use get_diameter to
identify this longest path. mean_distance is the average
number of edges between any two nodes in the network. We can find each
of these paths between pairs of edges with distances.
diameter(g, directed=FALSE, weights=NA)## [1] 3
get_diameter(g, directed=FALSE, weights=NA)## + 4/22 vertices, named, from e83521e:
## [1] DARTH VADER CHEWBACCA C-3PO OWEN
mean_distance(g, directed=FALSE, weights=NA)## [1] 1.909524
dist <- distances(g, weights=NA)
dist[1:5, 1:5]## R2-D2 CHEWBACCA C-3PO LUKE DARTH VADER
## R2-D2 0 1 1 1 2
## CHEWBACCA 1 0 1 1 1
## C-3PO 1 1 0 1 2
## LUKE 1 1 1 0 2
## DARTH VADER 2 1 2 2 0
edge_density is the number of edges in a network divided
by the total possible edges, or the sum of edge weights divided by the
number of possible edges. A potentially important measure for
normalizing observed degree distributions as larger networks tend
towards very low densities.
edge_density(g)## [1] 0.2597403
# 22*21 possible edges / 2 because it's undirected = 231 possible edges
# but only 60 exist
60/((22*21)/2)## [1] 0.2597403
reciprocity measures the propensity of each edge to be a
mutual edge; that is, the probability that if i is
connected to j, j is also connected to
i.
reciprocity(g)## [1] 1
# Why is it 1?transitivity, also known as clustering coefficient,
measures the probability that adjacent nodes of a network are connected.
In other words, if i is connected to j, and
j is connected to k, what is the probability
that i is also connected to k? This is
potentially an important measure, particularly when measuring
interactions, as it captures the level of clustering in the network.
transitivity(g)## [1] 0.5375303
Where do the characters show a larger proportion of connections, in Star Wars or Game of Thrones?
Networks often have different clusters or communities of nodes that are more densely connected to each other than to the rest of the network. Let’s cover some of the different existing methods to identify these communities.
The most straightforward way to partition a network is into connected components. Each component is a group of nodes that are connected to each other, but not to the rest of the nodes. For example, this network has two components.
components(g)## $membership
## R2-D2 CHEWBACCA C-3PO LUKE DARTH VADER CAMIE
## 1 1 1 1 1 1
## BIGGS LEIA BERU OWEN OBI-WAN MOTTI
## 1 1 1 1 1 1
## TARKIN HAN GREEDO JABBA DODONNA GOLD LEADER
## 1 1 1 1 1 1
## WEDGE RED LEADER RED TEN GOLD FIVE
## 1 1 1 2
##
## $csize
## [1] 21 1
##
## $no
## [1] 2
par(mar=c(0,0,0,0))
plot(g)Most networks have a single giant connected component that includes most nodes. Most studies of networks actually focus on the giant component (e.g. the shortest path between nodes in a network with two or more component is Inf!).
giant <- decompose(g)[[1]]
plot(giant)Even within a giant component, there can be different subsets of the network that are more connected to each other than to the rest of the network. The goal of community detection algorithms is to identify these subsets.
There are a few different algorithms, each following a different logic.
The walktrap algorithm finds communities through a series of short random walks. The idea is that these random walks tend to stay within the same community. The length of these random walks is 4 edges by default, but you may want to experiment with different values. The goal of this algorithm is to identify the partition that maximizes a modularity score. The modularity of a graph with respect to some division (or vertex types) measures how good the division is, or how separated are the different vertex types from each other. Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules.
cluster_walktrap(giant)## IGRAPH clustering walktrap, groups: 6, mod: 0.16
## + groups:
## $`1`
## [1] "CAMIE" "BIGGS" "DODONNA" "GOLD LEADER" "WEDGE"
## [6] "RED LEADER" "RED TEN"
##
## $`2`
## [1] "DARTH VADER" "MOTTI" "TARKIN"
##
## $`3`
## [1] "R2-D2" "CHEWBACCA" "C-3PO" "LUKE" "LEIA" "OBI-WAN"
## [7] "HAN"
## + ... omitted several groups/vertices
cluster_walktrap(giant, steps=10)## IGRAPH clustering walktrap, groups: 3, mod: 0.15
## + groups:
## $`1`
## [1] "DARTH VADER" "MOTTI" "TARKIN"
##
## $`2`
## [1] "R2-D2" "CHEWBACCA" "C-3PO" "LUKE" "LEIA" "BERU"
## [7] "OWEN" "OBI-WAN" "HAN" "GREEDO" "JABBA"
##
## $`3`
## [1] "CAMIE" "BIGGS" "DODONNA" "GOLD LEADER" "WEDGE"
## [6] "RED LEADER" "RED TEN"
## + ... omitted several groups/vertices
Other methods are:
cluster_fast_greedy(giant)## IGRAPH clustering fast greedy, groups: 4, mod: 0.17
## + groups:
## $`1`
## [1] "CHEWBACCA" "LUKE" "LEIA" "OBI-WAN" "HAN" "GREEDO"
## [7] "JABBA"
##
## $`2`
## [1] "R2-D2" "C-3PO" "BERU" "OWEN"
##
## $`3`
## [1] "CAMIE" "BIGGS" "DODONNA" "GOLD LEADER" "WEDGE"
## [6] "RED LEADER" "RED TEN"
## + ... omitted several groups/vertices
cluster_edge_betweenness(giant)## IGRAPH clustering edge betweenness, groups: 2, mod: 0.033
## + groups:
## $`1`
## [1] "R2-D2" "CHEWBACCA" "DARTH VADER" "LEIA" "OBI-WAN"
## [6] "MOTTI" "TARKIN" "HAN" "GREEDO" "JABBA"
##
## $`2`
## [1] "C-3PO" "LUKE" "CAMIE" "BIGGS" "BERU"
## [6] "OWEN" "DODONNA" "GOLD LEADER" "WEDGE" "RED LEADER"
## [11] "RED TEN"
##
cluster_infomap(giant)## IGRAPH clustering infomap, groups: 2, mod: 0.064
## + groups:
## $`1`
## [1] "R2-D2" "CHEWBACCA" "C-3PO" "LUKE" "CAMIE"
## [6] "BIGGS" "LEIA" "BERU" "OWEN" "OBI-WAN"
## [11] "HAN" "GREEDO" "JABBA" "DODONNA" "GOLD LEADER"
## [16] "WEDGE" "RED LEADER" "RED TEN"
##
## $`2`
## [1] "DARTH VADER" "MOTTI" "TARKIN"
##
cluster_label_prop(giant)## IGRAPH clustering label propagation, groups: 2, mod: 0.064
## + groups:
## $`1`
## [1] "R2-D2" "CHEWBACCA" "C-3PO" "LUKE" "CAMIE"
## [6] "BIGGS" "LEIA" "BERU" "OWEN" "OBI-WAN"
## [11] "HAN" "GREEDO" "JABBA" "DODONNA" "GOLD LEADER"
## [16] "WEDGE" "RED LEADER" "RED TEN"
##
## $`2`
## [1] "DARTH VADER" "MOTTI" "TARKIN"
##
There is no better or worse algorithm, we will chose one method or another according to our preferences. In general, fastgreedy is the fastest algorithm.
igraph also makes it very easy to plot the resulting
communities:
comm <- cluster_infomap(giant)
modularity(comm) # modularity score## [1] 0.06420569
par(mar=c(0,0,0,0))
plot(comm, giant)Alternatively, we can also add the membership to different
communities as a color parameter in the igraph object. That
is, we use the function membership on our identified
communities, and use this to change their colour.
V(giant)$color <- membership(comm)
par(mar=c(0,0,0,0))
plot(giant)The final way in which we can think about network communities is in terms of hierarchy or structure.
K-core decomposition allows us to identify the core
and the periphery of the network. A k-core is a maximal subnet of a
network such that all nodes have at least degree K. To calculate this we
will use the function coreness.
coreness(g)## R2-D2 CHEWBACCA C-3PO LUKE DARTH VADER CAMIE
## 6 6 6 6 3 2
## BIGGS LEIA BERU OWEN OBI-WAN MOTTI
## 5 6 3 3 6 3
## TARKIN HAN GREEDO JABBA DODONNA GOLD LEADER
## 3 6 1 1 5 5
## WEDGE RED LEADER RED TEN GOLD FIVE
## 5 5 2 0
which(coreness(g)==6) # what is the core of the network?## R2-D2 CHEWBACCA C-3PO LUKE LEIA OBI-WAN HAN
## 1 2 3 4 8 11 14
which(coreness(g)==1) # what is the periphery of the network?## GREEDO JABBA
## 15 16
# Visualizing network structure
V(g)$coreness <- coreness(g)
par(mfrow=c(2, 3), mar=c(0.1,0.1,1,0.1))
set.seed(777)
fr <- layout_with_fr(g)
for (k in 1:6){
V(g)$color <- ifelse(V(g)$coreness>=k, "orange", "grey")
plot(g, main=paste0(k, '-core shell'), layout=fr)
}What communities can you find in the Game of Thrones network? Try different community detection algorithms to see if you get different answers.
Up to this point, we have focused on describing networks, both visually and numerically. In this bonus section we aim to explain how networks emerge: what are the mechanisms that explain the structure of the observed networks?
One of the most basic notions governing network formation is homophily, that is, the propensity of individuals to cluster along common traits, such as age, gender, class, etc. Positive assortment suggests that nodes are more connected than expected, whereas negative assortment suggests avoidance of alike nodes. This can be measured on weighted networks and is a powerful approach for identifying phenotypic structure in social networks. For example, positive assortment by degree (gregariousness) has been linked with rapid spread of information or disease through social networks.
In our data set, we are going to explore the association among the different sides of the force, but first we need to assign the side of each character.
# Assign the attribute side to each of the characters
g <- set_vertex_attr(g, "side",
value = c(rep("light", 4), "dark", rep("light",6), rep("dark",2), "light", rep("other",2), rep("light",6)))
# We can explore that is correct
cbind(V(g)$side, V(g)) %>%
DT::datatable()Now that we have “side of the force” introduced as an attribute we can explore the influence of each side on the network. For example, we can calculate the homophily for the whole network and for each side of the force. We can measure the extent to which a network is homophilic along a specific variable by computing the assortativity coeficient. The assortativity coefficient measures the level of homophyly of the graph, based on some vertex labeling or values assigned to vertices. If the coefficient is high, that means that connected vertices tend to have the same labels or similar assigned values.
# Whole network
assortativity_degree(g, directed=FALSE)## [1] -0.1934052
# By side
assortativity_nominal(g, factor(V(g)$side))## [1] 0.4055202
The main limitation with this approach is that we don’t know to what extent this coefficients are different from what you would find simply by chance in any network. Furthermore, it is hard to disentangle what is the variable that is driving homophily. For example, the proportion of light side is higher than any of the other sides, and thus the homophily result for side could be simply due to a bias in the data.
We can perform a randomization procedure to determine how likely the observed assortativity in our network is given the side of the force of the characters. We can randomly permute the side of the force of the characters in the network 1000 times and recalculate the assortativity for each random network.
# Convert the side of the force attribute into a numeric value
values <- as.numeric(factor(V(g)$side))
# Calculate the observed assortativity
observed.assortativity <- assortativity(g, values)
# Calculate the assortativity of the network randomizing the side attribute 1000 times
results <- vector('list', 1000)
for(i in 1:1000){
results[[i]] <- assortativity(g, sample(values))
}
# Plot the distribution of assortativity values and add a red vertical line at the original observed value
hist(unlist(results))
abline(v = observed.assortativity, col = "red", lty = 3, lwd=2)Given that our value is well outside the permutation area, we can see that our assortativity is not due to chance.
For ease of reproducibility
sessionInfo()## R version 4.2.2 (2022-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] DT_0.27 forcats_0.5.2 stringr_1.5.0 dplyr_1.0.10
## [5] purrr_1.0.1 readr_2.1.3 tidyr_1.3.0 tibble_3.1.8
## [9] ggplot2_3.4.0 tidyverse_1.3.2 igraph_1.3.5
##
## loaded via a namespace (and not attached):
## [1] lattice_0.20-45 lubridate_1.9.1 assertthat_0.2.1
## [4] digest_0.6.31 utf8_1.2.2 R6_2.5.1
## [7] cellranger_1.1.0 backports_1.4.1 reprex_2.0.2
## [10] evaluate_0.20 highr_0.10 httr_1.4.4
## [13] pillar_1.8.1 rlang_1.0.6 googlesheets4_1.0.1
## [16] readxl_1.4.1 rstudioapi_0.14 jquerylib_0.1.4
## [19] Matrix_1.5-1 rmarkdown_2.20 googledrive_2.0.0
## [22] htmlwidgets_1.6.1 munsell_0.5.0 broom_1.0.2
## [25] compiler_4.2.2 modelr_0.1.10 xfun_0.36
## [28] pkgconfig_2.0.3 htmltools_0.5.4 tidyselect_1.2.0
## [31] fansi_1.0.4 crayon_1.5.2 tzdb_0.3.0
## [34] dbplyr_2.3.0 withr_2.5.0 grid_4.2.2
## [37] jsonlite_1.8.4 gtable_0.3.1 lifecycle_1.0.3
## [40] DBI_1.1.3 magrittr_2.0.3 scales_1.2.1
## [43] cli_3.6.0 stringi_1.7.12 cachem_1.0.6
## [46] fs_1.6.0 xml2_1.3.3 bslib_0.4.2
## [49] ellipsis_0.3.2 generics_0.1.3 vctrs_0.5.2
## [52] tools_4.2.2 glue_1.6.2 crosstalk_1.2.0
## [55] hms_1.1.2 fastmap_1.1.0 yaml_2.3.7
## [58] timechange_0.2.0 colorspace_2.1-0 gargle_1.2.1
## [61] rvest_1.0.3 knitr_1.41 haven_2.5.1
## [64] sass_0.4.5
Universitat de Barcelona, pcapdevila@ub.edu↩︎